{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "6393d1c3-3ab9-45b9-970c-3d882c62d3cf",
   "metadata": {},
   "source": [
    "# Preprocessing of ULB Credit Card Dataset\n",
    "\n",
    "### Introduction\n",
    "We use the dataset of [ULB Credit Card Dataset](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) to train our frauld detection model. In this notebook, we preprocess the dataset and generate features, which refers to some execellent work listed as below:\n",
    "\n",
    "* **Fraud detection handbook**: https://fraud-detection-handbook.github.io/fraud-detection-handbook/Foreword.html\n",
    "* **AWS creditcard fraud detector**: https://github.com/awslabs/fraud-detection-using-machine-learning/blob/master/source/notebooks/sagemaker_fraud_detection.ipynb\n",
    "* **Creditcard fraud detection predictive models**: https://www.kaggle.com/code/gpreda/credit-card-fraud-detection-predictive-models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aaf7c6b4-9e68-4411-96e2-5a04fe080752",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import pandas as pd\n",
    "\n",
    "from sklearn.model_selection import train_test_split\n",
    "from imblearn.over_sampling import SMOTE\n",
    "from collections import Counter"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ab933799-0669-49ad-8f25-e829279dac86",
   "metadata": {},
   "source": [
    "### Train and Test split\n",
    "\n",
    "Assuming we dowloaded creditcard dataset from [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud). Now we will split our dataset into a train and test to evaluate the performance of our models. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c65b6b89-e8e8-4b8a-b951-0a4e6b619b11",
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.read_csv('creditcard.csv', delimiter=',')\n",
    "print(data.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01f47caf-eac5-440f-87ed-69d3fc1f862f",
   "metadata": {},
   "outputs": [],
   "source": [
    "nonfrauds, frauds = data.groupby('Class').size()\n",
    "print('Number of frauds: ', frauds)\n",
    "print('Number of non-frauds: ', nonfrauds)\n",
    "print('Percentage of fradulent data:', frauds/(frauds + nonfrauds))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34098328-5d82-4af2-9d02-1ad198a72315",
   "metadata": {},
   "outputs": [],
   "source": [
    "feature_columns = data.columns[:-1]\n",
    "label_column = data.columns[-1]\n",
    "\n",
    "features = data[feature_columns].values.astype('float32')\n",
    "labels = (data[label_column].values).astype('float32')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "01a43822-ac56-463b-9f45-a391dd3161ed",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(\n",
    "    features, labels, test_size=0.1, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8cadb773-4fcb-46d6-8b43-7842c5cdab32",
   "metadata": {},
   "outputs": [],
   "source": [
    "train = pd.DataFrame(np.column_stack([X_train, y_train]), columns = data.columns)\n",
    "test = pd.DataFrame(np.column_stack([X_test, y_test]), columns = data.columns)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7a5360cc-b1bb-446f-9809-728583170e0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sorted(Counter(y_train).items()))\n",
    "print(sorted(Counter(y_test).items()))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "796cdd60-3597-4880-82b5-1fd5055e9c67",
   "metadata": {},
   "source": [
    "### SMOTE\n",
    "We will be using [Sythetic Minority Over-sampling (SMOTE)](https://arxiv.org/abs/1106.1813), which oversamples the minority class by interpolating new data points between existing ones."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "34cd5f38-6524-48b9-a962-d4d1cbd46e03",
   "metadata": {},
   "outputs": [],
   "source": [
    "## Hyper params\n",
    "sampling_ratio=0.1\n",
    "seed=42"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4f12b0da-128a-44d0-9ed5-a409485f779c",
   "metadata": {},
   "outputs": [],
   "source": [
    "smote = SMOTE(sampling_strategy=sampling_ratio, random_state=seed)\n",
    "X_smote, y_smote = smote.fit_resample(X_train, y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f7d15eab-21b9-4b59-92c3-d99c29dfadd9",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(sorted(Counter(y_smote).items()))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "84722c8e-ac3b-48ac-9bfd-6f3c4fc1fb39",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_smote = pd.DataFrame(np.column_stack([X_smote, y_smote]), columns = data.columns)\n",
    "train_smote"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a1998ad5-0dd9-41b4-b01e-d3ef626ee878",
   "metadata": {},
   "source": [
    "We should check the data after **SMOTE** sampling."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "855f884f-38ef-4054-afa0-584bc85c6c13",
   "metadata": {},
   "outputs": [],
   "source": [
    "data[data['Time']==28515.000000]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30455a57-a2f2-44e3-b3a7-948d0eba017a",
   "metadata": {},
   "outputs": [],
   "source": [
    "train_smote[train_smote['Time']==28515.000000]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "96601f8d-3f19-4fe7-89a7-642b702ef77b",
   "metadata": {},
   "source": [
    "We can write the data to s3 cloud object storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e13bd551-9276-4c86-95aa-cb20d645230f",
   "metadata": {},
   "outputs": [],
   "source": [
    "train.to_csv('creditcard_train.csv', sep=',', index=False, encoding='utf-8')\n",
    "test.to_csv('creditcard_test.csv', sep=',', index=False, encoding='utf-8')\n",
    "train_smote.to_csv('creditcard_train_smote.csv', sep=',', index=False, encoding='utf-8')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "476d717b-abed-49bb-abcd-770837b1640e",
   "metadata": {},
   "source": [
    "You should replace `{MY_S3_BUCKET}` with actual values before executing code cells below containing these placeholders."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f0e75695-ef89-4188-8e90-1938f0145577",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 cp ./creditcard_train.csv ${MY_S3_BUCKET}/risk/ulb/\n",
    "!aws s3 cp ./creditcard_test.csv ${MY_S3_BUCKET}/risk/ulb/\n",
    "!aws s3 cp ./creditcard_train_smote.csv ${MY_S3_BUCKET}/risk/ulb/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1670d281-2539-4af3-8552-8787b4d84636",
   "metadata": {},
   "outputs": [],
   "source": [
    "!aws s3 ls ${MY_S3_BUCKET}/risk/ulb/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23b6ea0a-0c31-4450-9a2d-cdee00d46858",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
